from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
!pip install h2o
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting h2o
Downloading h2o-3.40.0.2.tar.gz (177.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.6/177.6 MB 4.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from h2o) (2.25.1)
Requirement already satisfied: tabulate in /usr/local/lib/python3.9/dist-packages (from h2o) (0.8.10)
Requirement already satisfied: future in /usr/local/lib/python3.9/dist-packages (from h2o) (0.16.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (1.26.15)
Building wheels for collected packages: h2o
Building wheel for h2o (setup.py) ... done
Created wheel for h2o: filename=h2o-3.40.0.2-py2.py3-none-any.whl size=177693439 sha256=c889ea192bd3ffe65aff7da84c509223a41d1df2158b3e6621b0639ec5aa4c63
Stored in directory: /root/.cache/pip/wheels/b2/79/e3/842b81607eb31946ee24898cc9961b101e6486f988a5103967
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.40.0.2
import h2o
from h2o.automl import H2OAutoML
h2o.init()
# Import data
f = "/content/drive/MyDrive/Qualities in Intelligent Systems/Bike-Sharing-Dataset/day.csv"
df = h2o.import_file(f)
# Reponse column
y = "cnt"
# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]
# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)
# Explain leader model & compare with all AutoML models
exa = aml.explain(test)
# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)
Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "11.0.18" 2023-01-17; OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1); OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing) Starting server from /usr/local/lib/python3.9/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmp7fslgtys JVM stdout: /tmp/tmp7fslgtys/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmp7fslgtys/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
| H2O_cluster_uptime: | 02 secs |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.40.0.2 |
| H2O_cluster_version_age: | 7 days, 23 hours and 54 minutes |
| H2O_cluster_name: | H2O_from_python_unknownUser_kavmgw |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 3.172 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://127.0.0.1:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.9.16 final |
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| StackedEnsemble_BestOfFamily_3_AutoML_1_20230317_151130 | 98.106 | 9624.79 | 67.2775 | 0.0329657 | 9624.79 | 240 | 0.232783 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_1_20230317_151130 | 98.106 | 9624.79 | 67.2775 | 0.0329657 | 9624.79 | 288 | 0.069139 | StackedEnsemble |
| StackedEnsemble_AllModels_1_AutoML_1_20230317_151130 | 99.3945 | 9879.27 | 67.246 | 0.0311356 | 9879.27 | 566 | 0.18038 | StackedEnsemble |
| StackedEnsemble_AllModels_2_AutoML_1_20230317_151130 | 99.9931 | 9998.62 | 67.6545 | 0.0312209 | 9998.62 | 283 | 0.167626 | StackedEnsemble |
| GBM_3_AutoML_1_20230317_151130 | 103.66 | 10745.5 | 69.3182 | 0.0348153 | 10745.5 | 642 | 0.034323 | GBM |
| GBM_4_AutoML_1_20230317_151130 | 143.498 | 20591.7 | 95.6842 | 0.0507063 | 20591.7 | 993 | 0.044029 | GBM |
| StackedEnsemble_BestOfFamily_1_AutoML_1_20230317_151130 | 158.688 | 25182 | 107.911 | 0.0539144 | 25182 | 774 | 0.044236 | StackedEnsemble |
| XGBoost_1_AutoML_1_20230317_151130 | 166.693 | 27786.6 | 117.504 | 0.0562133 | 27786.6 | 618 | 0.014036 | XGBoost |
| GBM_2_AutoML_1_20230317_151130 | 177.254 | 31418.8 | 120.446 | 0.0624025 | 31418.8 | 1162 | 0.027619 | GBM |
| XGBoost_grid_1_AutoML_1_20230317_151130_model_1 | 192.431 | 37029.7 | 147.774 | 0.0930426 | 37029.7 | 251 | 0.013005 | XGBoost |
| GBM_5_AutoML_1_20230317_151130 | 202.543 | 41023.8 | 131.344 | 0.070276 | 41023.8 | 324 | 0.023635 | GBM |
| XGBoost_2_AutoML_1_20230317_151130 | 206.521 | 42651 | 155.068 | 0.0683256 | 42651 | 659 | 0.012363 | XGBoost |
| XGBoost_grid_1_AutoML_1_20230317_151130_model_2 | 217.912 | 47485.4 | 168.958 | 0.0660485 | 47485.4 | 109 | 0.010368 | XGBoost |
| DRF_1_AutoML_1_20230317_151130 | 250.339 | 62669.4 | 176.565 | 0.0969177 | 62669.4 | 801 | 0.01938 | DRF |
| XRT_1_AutoML_1_20230317_151130 | 298.545 | 89129.4 | 203.117 | 0.111638 | 89129.4 | 300 | 0.008238 | DRF |
| GBM_1_AutoML_1_20230317_151130 | 318.378 | 101364 | 243.11 | 0.135626 | 101364 | 1574 | 0.020177 | GBM |
| GLM_1_AutoML_1_20230317_151130 | 344.39 | 118604 | 257.38 | 0.129314 | 118604 | 100 | 0.004387 | GLM |
| GBM_grid_1_AutoML_1_20230317_151130_model_1 | 380.558 | 144824 | 260.046 | 0.141757 | 144824 | 251 | 0.019313 | GBM |
| DeepLearning_1_AutoML_1_20230317_151130 | 396.334 | 157080 | 292.219 | 0.157206 | 157080 | 106 | 0.010369 | DeepLearning |
| XGBoost_3_AutoML_1_20230317_151130 | 1228.5 | 1.50921e+06 | 1087.23 | 0.302479 | 1.50921e+06 | 687 | 0.005119 | XGBoost |
[20 rows x 9 columns]
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
The variable importance plot shows the relative importance of the most important variables in the model.
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
import h2o
from h2o.automl import H2OAutoML
h2o.init()
# Import data
f = "/content/drive/MyDrive/Qualities in Intelligent Systems/Bike-Sharing-Dataset/hour.csv"
df = h2o.import_file(f)
# Reponse column
y = "cnt"
# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]
# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)
# Explain leader model & compare with all AutoML models
exa = aml.explain(test)
# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
| H2O_cluster_uptime: | 6 mins 26 secs |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.40.0.2 |
| H2O_cluster_version_age: | 8 days |
| H2O_cluster_name: | H2O_from_python_unknownUser_kavmgw |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 2.965 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.9.16 final |
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance | training_time_ms | predict_time_per_row_ms | algo |
|---|---|---|---|---|---|---|---|---|
| StackedEnsemble_AllModels_1_AutoML_2_20230317_151751 | 2.11301 | 4.46482 | 1.68988 | nan | 4.46482 | 474 | 0.013401 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_2_AutoML_2_20230317_151751 | 2.11301 | 4.46482 | 1.68988 | nan | 4.46482 | 449 | 0.007542 | StackedEnsemble |
| StackedEnsemble_BestOfFamily_1_AutoML_2_20230317_151751 | 2.11685 | 4.48106 | 1.69214 | nan | 4.48106 | 701 | 0.00787 | StackedEnsemble |
| GLM_1_AutoML_2_20230317_151751 | 3.08055 | 9.48976 | 2.1927 | nan | 9.48976 | 289 | 0.00121 | GLM |
| XGBoost_1_AutoML_2_20230317_151751 | 7.14944 | 51.1145 | 4.704 | 0.0928143 | 51.1145 | 2942 | 0.004964 | XGBoost |
| GBM_1_AutoML_2_20230317_151751 | 12.3146 | 151.649 | 7.45037 | 0.149081 | 151.649 | 3393 | 0.01313 | GBM |
| DRF_1_AutoML_2_20230317_151751 | 21.6649 | 469.367 | 10.9212 | 0.147284 | 469.367 | 441 | 0.001548 | DRF |
| GBM_3_AutoML_2_20230317_151751 | 96.1387 | 9242.64 | 74.8388 | 1.19714 | 9242.64 | 300 | 0.002984 | GBM |
| GBM_4_AutoML_2_20230317_151751 | 117.481 | 13801.8 | 91.5657 | 1.30082 | 13801.8 | 166 | 0.00264 | GBM |
| GBM_2_AutoML_2_20230317_151751 | 128.581 | 16533 | 100.298 | 1.35206 | 16533 | 154 | 0.001923 | GBM |
| XGBoost_3_AutoML_2_20230317_151751 | 180.168 | 32460.7 | 130.209 | 1.13754 | 32460.7 | 53 | 0.000492 | XGBoost |
| XGBoost_2_AutoML_2_20230317_151751 | 180.685 | 32647.2 | 130.09 | 1.12879 | 32647.2 | 593 | 0.000833 | XGBoost |
[12 rows x 9 columns]
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
The variable importance plot shows the relative importance of the most important variables in the model.
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.
SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn import metrics
# Import ddata
df=pd.read_csv("/content/drive/MyDrive/Qualities in Intelligent Systems/Bike-Sharing-Dataset/day.csv")
X = df.drop("cnt", axis=1)
y = df.pop("cnt")
X, y = make_regression(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
reg = GradientBoostingRegressor(random_state=0)
reg.fit(X_train, y_train)
GradientBoostingRegressor(random_state=0)
reg.predict(X_test[1:2])
reg.score(X_test, y_test)
0.4403245677708285
To understand how various environmental conditions affect the number of bikes rented, we can create different plots to visualize the relationship between the input features and the target variable (i.e., the number of bikes rented).